In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
In [19]:
pd.set_option('display.max_colwidth', 1000)
In [2]:
DATA_DIR = '../data/'
SEED = 12
In [3]:
import pandas as pd
In [4]:
toxicity_annotated_comments = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxicity_annotations = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotations.tsv'), sep = '\t')
In [5]:
annotations_gped = toxicity_annotations.groupby('rev_id', as_index=False).agg({'toxicity': 'mean'})
all_data = pd.merge(annotations_gped, toxicity_annotated_comments, on = 'rev_id')
In [6]:
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
# TODO(nthain): Consider doing regression instead of classification
all_data['is_toxic'] = all_data['toxicity'] > 0.5
In [7]:
# split into train, valid, test
wiki_splits = {}
for split in ['train', 'test', 'dev']:
wiki_splits[split] = all_data.query('split == @split')
In [8]:
#for split in wiki_splits:
# wiki_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s.csv' % split), index=False)
In [9]:
def augment_with_data(source_df, target_path, target_name, sep = '\t', write = False):
target_df = pd.read_csv(target_path, sep = '\t')
target_df['sample'] = target_name
target_splits = {}
for split in source_df:
target_splits[split] = pd.concat([source_df[split],
target_df.query('split == @split')]).sample(frac = 1, random_state = SEED)
if write:
target_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s_%s.csv' % (target_name, split)), index=False)
return target_splits
In [10]:
debias_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data.tsv', 'debias')
In [11]:
wiki_splits['train'].shape
Out[11]:
(95692, 9)
In [12]:
debias_splits['train'].shape
Out[12]:
(99157, 9)
In [13]:
random_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data_random.tsv', 'debias_random')
In [14]:
random_splits['train'].shape
Out[14]:
(99157, 9)
In [21]:
random_splits['train'].query('sample == "random"')
Out[21]:
comment
is_toxic
logged_in
ns
rev_id
sample
split
toxicity
year
25577
` == The Halo's RfA == {| |- | |valign=top|Mark Dingemanse...Thank you very much for the constructive criticism in your oppose comment in my request for adminship. Ultimately, no consensus was reached, and I failed to be promoted, but I am very grateful for your coments. I will strive to better myself in all areas, especially Mainspace. |} `
False
True
user
7.7269e+07
random
train
0.0
2006.0
150589
: You've changed your point several times. Sorry if I missed it.
False
True
article
6.42874e+08
random
train
0.0
2015.0
144969
` *Neither admins nor non-admins should count votes, as you and I both already know. No actual case for either having primary topic was made. `
False
True
user
6.1052e+08
random
train
0.0
2014.0
41923
It's interesting that this IP feels they know me so well, when I have had limited interaction with you. We have never actively engaged each other on talk pages. It's amazing you somehow have been able to edit every page I edit; every photograph you crop is mine. Have you wondered where this ire against me and my work has come from? You continually put a photo of a man in a wig (a photoshopped wig, at that) on the Afro page, and four different editors have removed it, five if you include me. This is what you consider ownership and edit warring? Then you have taken my photographs and decided to rename them simply to remove David Shankbone out of the file name. Your intentions are no pure. You are gaming the policies and guidelines to Wikistalk me. You say not to bite the newcomers, but I haven't actively engaged you. If you are a newcomer, how come you have such a handy knowledge of Wiki policies and guidelines? Why is every page in your history one I have contributed to...
False
True
user
1.33966e+08
random
train
0.0
2007.0
134484
:::::I guess it comes down to substantial and my experience with the Danish version of it being being a bit more loosely defined (such as there is no actual demand of the ball being a tennis ball or who throws the ball).
False
True
article
5.49524e+08
random
train
0.0
2013.0
59535
` Please stop. If you continue to vandalize Wikipedia, you will be blocked from editing. `
False
True
user
1.95916e+08
random
train
0.2
2008.0
39227
` :I removed them. `
False
True
article
1.23612e+08
random
train
0.0
2007.0
104536
` == Stray punctuation == FYI, I believe this is now fixed per your query at . Thanks! `
False
True
user
3.81579e+08
random
train
0.0
2010.0
83332
REDIRECT Talk:List of diplomatic missions of Switzerland
False
True
article
2.8586e+08
random
train
0.0
2009.0
3429
More Dutch speakers == I would say about 23 million people speak Dutch rahter than 20 million: * Netherlands 16 million * Belgium 6 million * Suriname, Antilles, other communities 1 million * total 23 million [anonymous] I don't know where you get those numbers, where comes the 1 million number? population: * Antilles 212,226 * Aruba 103,000 * Suriname 438,144 the population of these three regions doesnt reach one million, and Dutch is definetly not a national language in these places. Plus, the Netherlands and Belgium have a lot of immigrants. So I belive more in the 20 million figure, rather than the 23 million one.- 5 July 2005 11:27 (UTC) :There are for example probably about 12 million native speakers in the Netherlands, once you sutract immigrants, Frisian, Limburgish, and other Germanic language speakers. Suriname definitely isn't natively Dutch speaking; some of the languages ...
False
True
article
1.82048e+07
random
train
0.0
2005.0
81674
what the hell i think im a vampire hence the name and im a schoolgirl so dont diss. Btw i am cleaning this up i read all the books and am obsessed. TwilightVampire4Ever
False
True
article
2.78791e+08
random
train
0.3
2009.0
117111
:I created this account because my other one got blocked. There's no rule against that. Yes there is; it's called block evasion. As far as me having something against you personally, I've never heard of you before or seen anything of you before you put up the unblock request; as a checkuser, my tools include the ability to check to see if people are, as you were, abusing multiple accounts.
False
True
user
4.45742e+08
random
train
0.1
2011.0
3836
*Willmcw: Okay, but I don't undersatnd what you mean about mirrors and links? Could you elaborate, just for my edification? Also, what would you have me do? You are deleting my personal name; so is it okay, if this happens, if I add at least an external link to my site, or do you have some vendetta against me? I think you have the wrong impression of me. I would like to make a fresh start with you and be on good terms (and no, I don't expect you to change your vote to delete my entry). I am willing to try to make it clear that I am sincere and in good faith. Please let me know what I can do to demonstrate that to your satisfaction. Yesterday I indeed started a few entries, and am enjoying this. I want to do it right, but am an amateur at some of the rules and policies and editing techniques. Sincerely, Stephan Kinsella
False
True
user
1.92391e+07
random
train
0.0
2005.0
98293
` == Technical Problems == There is no mention in the article about the Yellow Light of Death (YLOD) which indicates a general hardware failure on the PS3. There is ample discussion of this matter and a wide body of citations available; Currently 9M returns for ``yellow light of death`` http://search.live.com/results.aspx?q=yellow+light+of+death&form;=QBLH&filt;=all and 1.9M returns for ``yellow light of death`` http://www.google.ca/search?hl=en&q;=yellow+light+of+death&meta;=&aq;=f&oq;= There is also zero mention the bricking problems Sony has while updating PS3 firmware. There is ample discussion of this matter and a wide body of citations available, this is but one; http://gizmodo.com/5021399/playstation-3-firmware-24-bricking-some-ps3s Something should be mentioned in a new section, titled ``Technical Problems``. :Addition of ``YLOD`` has been discussed here in the past and the main reason it hasn't been included is that there weren't any reliable, notable so...
False
True
article
3.52411e+08
random
train
0.1
2010.0
112105
:* I've tried to reword and reduce this bit
False
True
article
4.19913e+08
random
train
0.0
2011.0
112882
Okay, thank you, you just clearly violated the civility rule by calling me stupid, that's just a bonus.
False
True
user
4.23551e+08
random
train
0.3
2011.0
93403
:In context that he was a founder-member, 'played with' is fine to mean he was still a member: if wished you could clarify by saying 'he continued to play with them...' but it would be a bit long-winded. The second 'Melos' bluelink is entirely optional: the article isn't long enough to make it really necessary, but repeat bluelinks from the intro paragraph into the main text are not uncommon and do sometimes help. I can't decide if 'premiere' for the Britten works should be singular or plural. A footnote source for the appointments at Michigan would help, as this is told 'on trust'.If there is more to add, it would be nice to draw the sentences together a little more into two or three paragraphs. But these are minutiae, quite at your discretion....! - all is well.
False
True
user
3.29427e+08
random
train
0.0
2009.0
102209
== Trade to Wizards == On NBA draft night the bulls traded Kirk Hinrich to the wizards for the 17th pick in the nba draft and a future 2nd round pick. see story here http://voices.washingtonpost.com/wizardsinsider/2010/06/report-wizards-trade-for-kirk.html
False
True
article
3.71123e+08
random
train
0.0
2010.0
157610
Ca'Foscari University - History of English Culture Project Hi everyone! I'm a postgraduate student from Ca' Foscari University, Venice. As an assignement for my History of English Culture course I have chosen to expand this article. As a provisional bibliography I have selected these sources: Tim Hitchcock and Robert Shoemaker, Tales from the Hanging Court, London, Hodder Arnold, 2007 www.oldbaileyonline.org If you have any suggestions or advice, I would really appreciate it. Feel free to talk to me here or on my talk page!
False
True
article
6.86479e+08
random
train
0.0
2015.0
107412
::: If we're not permitted to discuss the subject matter, then I will cease. It is disappointing though that there is an alternative interpretation of the halting problem proof which isn't represented in the article.
False
False
article
3.95864e+08
random
train
0.0
2010.0
142178
==DYK nomination of The FP== Hello! Your submission of The FP at the Did You Know nominations page has been reviewed, and some issues with it may need to be clarified. Please review the comment(s) underneath your nomination's entry and respond there as soon as possible. Thank you for contributing to Did You Know! v/r -
False
True
user
5.95401e+08
random
train
0.0
2014.0
127841
Your city looks very pretty from the wiki images on the main article. I don't how far I come, if so i'll just place on hold till the 10th
False
True
user
5.05194e+08
random
train
0.0
2012.0
113152
` :The article just says he's a person that has worked for a number of companies. –– `
False
True
user
4.24933e+08
random
train
0.0
2011.0
134282
` ==Your warning== You might want to redirect that warning to the person who created the page and added the content ;) `
False
True
user
5.48051e+08
random
train
0.0
2013.0
129454
` : Dues to SUL conflict! `
False
True
user
5.15392e+08
random
train
0.0
2012.0
140837
` :That sounds like Cohen v. California. `
False
True
article
5.87635e+08
random
train
0.0
2013.0
37836
` ==A view of a Japanese columnist== A conservative columnist, Hideaki Kase writes in NewsWeek:The fact is that the brothels were commercial establishments. U.S. Army records explicitly declare that the comfort women were prostitutes, and found no instances of ``kidnapping`` by the Japanese authorities. It's also worth noting that some 40 percent of these women were of Japanese origin.`
False
True
article
1.18503e+08
random
train
0.3
2007.0
117143
` ==Sockpuppetry case== {| align=``left`` style=``background: transparent;`` || |} Your name has been mentioned in connection with a sockpuppetry case. Please refer to Wikipedia:Sockpuppet investigations/Jaimatadi000 for evidence. Please make sure you make yourself familiar with the guide to responding to cases before editing the evidence page. `
False
True
user
4.45986e+08
random
train
0.1
2011.0
34091
` :Please go right now and read Wikipedia policy. You must treat other users with respect and not address them in such a confrontational tone. Do not insinuate that they are morons for taking certain positions on this issue. :Now, to your point. No one is claiming that the image of Muhammad is any more ``real`` than the image of Qin Shi Huang. They are both done in later times to represent an historical figure. Muslims do have a tradition of representing the prophet in images even if it is not the most prevalent. What the image on the page represents is part of the tradition that Muslims have in representing Muhammad. Now, feel free to enter into discussion about whether or not you believe such images are important enough to warrant inclusion on Muhammad but make sure you are arguing about depictions in Muslim tradition and not merely your own personal feelings. Also note that Depictions of Muhammad (AVOID if you will be offended) has various images of MuhammadMuslim and n...
False
True
article
1.05739e+08
random
train
0.1
2007.0
62361
== NO == NO SOUP FOR YOU sorry - couldn't resist!
False
True
user
2.04994e+08
random
train
0.1
2008.0
...
...
...
...
...
...
...
...
...
...
103637
:::Thank you for that reference, I think it's assisted me greatly. Some of the things that it suggests should be there are obviously not applicable (there's no plot!) but overall I think it's given me a lot of help with the tone. I've now re-written the lead keeping these principles and your comments in mind and I think it's an improvement, what do you think?
False
True
article
3.77424e+08
random
train
0.0
2010.0
138908
::Hi, adding my thoughts here. While you are welcome to your opinion of Manning's gender, it is not germane to the title discussion, and the guidelines we developed in advance of the discussion suggested that we keep such opinions to ourselves. That said, I would ask those writing here to consider dropping similar notes on the pages of those who share their opinion that Manning is a woman, as that is not a valid argument for a title change either.
False
True
user
5.75297e+08
random
train
0.1
2013.0
66939
` :::let me point out first that you're both falling into a very common pattern here. I've seen it a million times, in and out of Wikipedia: when a dispute is right on the verge of being resolved, frustrations and grumpinesses start to rise - not from the dispute per se, but from the kind of mild insulted feeling that's always a part of having a dispute. don't let it get ya. -) :::DGT, the ISBN is 1-58705-040-4. :::don't worry about a PDF version - Jeremy wasn't responding to the discussion that we were having directly, but just speaking in general. I think if we can all hang in with the Good Faith beliefs for just a little bit more while Lee de-commercializes the HTML version, we can let this discussion go and move on to bigger and better things. `
False
True
article
2.20873e+08
random
train
0.0
2008.0
148964
` ::::: :::::That is fine with me. AFAICT, the tool silently ignores references it cannot parse, so a simpler alternative would be to list those references along with the ones that resulted in HTTP errors. ::::: `
False
False
user
6.32827e+08
random
train
0.0
2014.0
112024
` ``Cattle class`` is another (possibly regional, definitely slang) nickname for economy class. Might or might not be worth including. And check the link for Via Rail, it is a Canadian company, not American. There isn't any routes in the US either.`
False
False
article
4.19529e+08
random
train
0.0
2011.0
15804
` :You will need to provide link diffs in the evidence section. Миборовский |||M|E|! `
False
True
user
5.26015e+07
random
train
0.0
2006.0
83283
== Easter Egg? == decided to leave it in and make it an easter egg.[citation needed] AFAIR, it's not an easter egg, as you simply GO INTO the next level, when you've won the first level, right ?
False
True
article
2.85591e+08
random
train
0.0
2009.0
39246
Paradox? Don't you mean conundrum. a paradox is something I thought to be impossible!! And this is not.
False
False
article
1.23676e+08
random
train
0.0
2007.0
44248
WikiProject Cold War history}} {{
False
True
article
1.42515e+08
random
train
0.0
2007.0
56128
` You'll do Whatever it takes to change the subject of why you and kos should be banned. ``D*ck`` is vandalism. It's just the politics here are fucking awful. Go ahead. Change the subject again. Whatever it takes to change the subject. Comedy. KOS was trying to justify the ban, in a debate, but in the middle I got banned. He couldnt come up with a reply so he banned me.. Comedy. Then he tried again, but I replied and got banned. Go back to suckin his dick k.`
True
False
user
1.8565e+08
random
train
0.9
2008.0
43774
` ::::Oh, I thought that this might have come into play: Others' comments It is not necessary to bring talk pages to publishing standards, so there is no need to correct typing errors, grammar, etc. It tends to irritate the users whose comments you are correcting. Never edit someone's words to change their meaning. Editing others' comments is not allowed. There are exceptions, however. Some are: * If you have their permission * Removing prohibited material such as libel and personal details * Removing personal attacks and incivility. This is controversial, and many editors do not feel it is acceptable; please read WP:ATTACK#Removal of text and WP:CIVIL#Removing uncivil comments before removing anything. * Unsigned comments: You are allowed to append or one of its variants to the end of someone's comment if they have failed to sign it. The form is {{subst:unsigned|USER NAME OR IP}}, which results in —The preceding unsigned comment was added by • .. * Interruptions: In some...
False
True
user
1.4068e+08
random
train
0.0
2007.0
8108
== Hello == Thanks for you greeting message. I became intereted in wiki because I wanted to edit my Oxford college page - thought it was too drab and I cant bear it any longer. I am sure you will be a big help to a new comer like me. Thanks a bunch in advance! - Mike
False
True
user
3.24775e+07
random
train
0.0
2005.0
124001
::Does that mean that within the Occupied Territories, 502 of the 515 criminal suits that year related to right wing Jewish settlers, or within Israel at large, 502 of the 515 criminal suits that year related to right wing Jewish settlers in the Occupied Territories?Best Wishes
False
True
article
4.83901e+08
random
train
0.0
2012.0
25626
Superfluous - what's a University for if not academic achievments, football?
False
False
article
7.73992e+07
random
train
0.0
2006.0
80116
:This is about you, not about other users. You are very, very fast to apply tu quoque fallacies when you're called on your behaviour. I have advised several editors previously about the recommendations of the above-mentioned essay, and will advise others if I see them engaging in similar behaviour. Please consider the recommendation of Matthew 7:5.
False
True
user
2.72511e+08
random
train
0.1
2009.0
47966
` == Lead section == I believe that the following sentence: :``When current is applied to the capacitor, electric charges of equal magnitude, but opposite polarity, build up on each plate.`` is more accurate than this version: :``When charge is made to flow into the capacitor, electric charges of equal magnitude, but opposite polarity, build up on each plate.`` You said in your edit summary that ``charge flows, current is a flow``, implying that current doesn't flow (verb). You are correct, so it's a good thing the original version didn't claim that ``current flows``! Also, the phrase ``charge is made to flow into the capacitor`` isn't really accurate, as overall, charge neither leaves nor enters the capacitor. Regards, `
False
True
article
1.5637e+08
random
train
0.0
2007.0
73646
Water is densest at 4 degrees Celsius, but it’s not that much denser than water at other temperatures. Exploiting tiny differences in the density of water of different temperatures would be a very inefficient way of producing power. The only ways I know of that produce significant amounts of power from cold water are hydroelectric power (dams), tidal power, wave power, and power from ocean currents (analogous to wind power). Each of these involves falling water or water that is moving somewhat rapidly.
False
True
article
2.46234e+08
random
train
0.0
2008.0
15926
:Hi Deskana. I have replied to Jitse on his talk page; take a look. ×
False
True
user
5.30151e+07
random
train
0.0
2006.0
95860
== David Carradine == You should realize what is vandalism and what isn't. Accusing me of vandalism is not constructive.
False
True
user
3.40808e+08
random
train
0.1
2010.0
80752
end quoted material.
False
True
article
2.751e+08
random
train
0.0
2009.0
32758
In your recent edit, you say that you are a real band, but are new to Wikipedia. Allow me to point out a few areas where you could improve. *All information in an article must be true, and verified by outside sources. I seriously doubt that this band sold 999,727,163 of anything, but you are welcome to add a link to a Rolling Stone article about this astounding achievement. *An article shouldn't be created by someone directly involved in the subject- for example, a band member shouldn't create an article about her own band. -
False
True
user
1.00875e+08
random
train
0.0
2007.0
106568
:::To me it has only two: a nation and open fields with grazing cows and sheep instead of smog, crowded streets and cars.
False
True
user
3.91449e+08
random
train
0.2
2010.0
8667
`Biblical allusions== I was thinking of putting this into the liteary allusions sections, but I'd like some comments and a quote check first. ``The film makes a number of seemingly biblical references, intrestingly it is mostly the Sith who make them. These possible references include: ``It is done`` ``I do not know you anymore`` ``If you are not with us, you are against us`` ( Though the last one might be more of a political reference to George Bushes use of the same quote).`` == `
False
False
article
3.38121e+07
random
train
0.1
2006.0
16520
` Fine, gentlemen, but as it appears it's only ourselves having this ``discussion``, allow me to submit this for your inspection, as sent by Turner May 19 to all media, the TPS and its Board (blocked by the latter two) and yes, Fantino himself, in response to the latest headlines. If you ever change your minds... Original to Leo Kinahan, counsel for Sgt. Jim Cassells Mr. Kinahan: Having discovered how sensitive you are about your e-mail address, note that I have included you as a BCC to protect your privacy. As I have been battling with the media ever since my saga began, I am only including them, at this time, in the hope that - considering the recent brave actions of your client and the resultant headlines which have followed - they may, finally, be forced to recognize my plight and resultant complaint since it all began. As briefly as I can describe it, I was set up by 55 Division in November of ' 98 with a manufactured charge of Criminal Harassment as a favour to their ...
False
False
article
5.4644e+07
random
train
0.0
2006.0
68912
==Proposed deletion== Delete - body of work doesn't justify any indication of notability plus Kyle doesn't want it here.
False
True
article
2.27893e+08
random
train
0.0
2008.0
103237
As I see the same quote is also present in this book.
False
True
article
3.75422e+08
random
train
0.0
2010.0
125131
` == Romney (neologism) == Is there reason to believe that this coined definition of ``Romney`` is anything more than a bit of non-notable WP:NOTNEWS? There is one article linked to CNN, which is of course a reliable source, but that doesn't mean that this is necessarily a definition which can be considered notable. I think there might be WP:BLP issues as well, though I am not as well versed as I should be in BLP policies so someone can correct me if I'm wrong...actually I think I'll ask because he knows his BLP policy well. In the Santorum case, Dan Savage was at least a notable person to begin with, but who is Jack Shepler and why is anything he says notable? :Looked into the issue and would have to agree with the points you make - I'm not seeing that the attempt at creating a neologism ended up being at all notable. ::Yes, no long term notability - this whole article has partisan attack issues. Its incredibly bloated and wants stripping to a couple of sentences and ...
False
True
article
4.90454e+08
random
train
0.0
2012.0
66934
` ::``Troll``: I'm sure, but what has happened and will continue to happen if is exactly why this kite is not yet ready to fly. You must have concensus, and it does not yet exist! `
False
False
article
2.2086e+08
random
train
0.1
2008.0
32852
== What about rsync.net? == Anyone mind if rsync.net is added to the Managed backup service providers list? I was going to add it but I didn't know the wiki naming convention: RsyncDotNet or Rsync_Net_(website) or ...?
False
True
article
1.01258e+08
random
train
0.0
2007.0
132886
*And just to be clear, an interview with an anonymous person claiming to be an Alawite during a time of conflict isn't exactly reliable and scholarly, when the purpose of said interview is demonisation of a religious group.
False
True
article
5.36603e+08
random
train
0.0
2013.0
48395 rows × 9 columns
In [ ]:
Content source: conversationai/unintended-ml-bias-analysis
Similar notebooks: